Off-policy and on-policy reinforcement learning with the Tsetlin machine

نویسندگان

چکیده

Abstract The Tsetlin Machine is a recent supervised learning algorithm that has obtained competitive accuracy- and resource usage results across several benchmarks. It been used for convolution, classification, regression, producing interpretable rules in propositional logic. In this paper, we introduce the first framework reinforcement based on Machine. Our integrates value iteration with regression as function approximator. To obtain accurate off-policy state-value estimation, propose modified feedback mechanism adapts to dynamic nature of iteration. particular, show able unlearn recover from misleading experiences often occur at beginning training. A key challenge address mapping intrinsically continuous architecture, leveraging probabilistic updates. While off-policy, learns significantly slower than neural networks on-policy. However, by introducing multi-step temporal-difference combination high-frequency logic patterns, are close performance gap. Several gridworld instances document our can outperform comparable network models, despite being simple one-level AND-rules Finally, how class models learnt problem be translated into more understandable graph structure. structure captures approximation corresponding policy found

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning

Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability comp...

متن کامل

Safe and Efficient Off-Policy Reinforcement Learning

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is efficient as it makes the b...

متن کامل

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...

متن کامل

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging onand off-policy updates for deep reinforcement learning. Theoretical resu...

متن کامل

Off-Policy Shaping Ensembles in Reinforcement Learning

Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel without sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensembl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied Intelligence

سال: 2023

ISSN: ['0924-669X', '1573-7497']

DOI: https://doi.org/10.1007/s10489-022-04297-3